Skip to content

fix: Make XML parsing more robust with regex fallback (#455)#473

Open
vivekvar-dl wants to merge 1 commit intoSylphAI-Inc:mainfrom
vivekvar-dl:fix-xml-parsing-issue-455
Open

fix: Make XML parsing more robust with regex fallback (#455)#473
vivekvar-dl wants to merge 1 commit intoSylphAI-Inc:mainfrom
vivekvar-dl:fix-xml-parsing-issue-455

Conversation

@vivekvar-dl
Copy link
Copy Markdown

Summary

This PR addresses issue #455 by making XML parsing in the TGD optimizer more robust when handling malformed output from LLMs like Gemini Flash 2.5.

Problem

The current XML parser (CustomizedXMLParser in tgd_optimizer.py) fails completely when LLMs produce malformed XML, leading to loss of the proposed_variable and other important fields. Users reported this was more frequent with Gemini Flash 2.5.

Solution

This PR implements a multi-layered approach to XML parsing:

  1. XML Sanitization - Removes invalid control characters and handles CDATA sections before parsing
  2. Regex Fallback - When ET.fromstring fails, falls back to regex-based extraction
  3. Better Text Extraction - Improved handling of nested XML elements and text content
  4. Graceful Error Recovery - Returns extracted content instead of error placeholders

Changes

  • Added _sanitize_xml() method to clean invalid characters and extract CDATA content
  • Added _extract_with_regex() fallback parser using regex patterns
  • Enhanced get_element_text() helper to handle nested elements properly
  • Improved error handling with informative logging via log.warning() and log.info()

Testing

Tested with various malformed inputs:

  • ✅ Unclosed tags
  • ✅ Invalid control characters
  • ✅ CDATA sections
  • ✅ Missing wrapper elements
  • ✅ Nested XML in content
  • ✅ Completely broken XML

All test cases now successfully extract the proposed_variable field.

Backward Compatibility

✅ Fully backward compatible - well-formed XML still parses via the standard XML parser. The regex fallback only activates when XML parsing fails.

Fixes #455

- Add XML sanitization to remove invalid control characters
- Handle CDATA sections properly by extracting content
- Implement regex-based fallback when ET.fromstring fails
- Improve text extraction to handle nested XML elements
- Add comprehensive error recovery for malformed LLM output

This addresses issue SylphAI-Inc#455 where XML parsing was failing on malformed
output from Gemini Flash 2.5. The parser now gracefully falls back
to regex extraction when strict XML parsing fails, ensuring that
the proposed_variable and other fields are still extracted correctly.

The changes include:
1. _sanitize_xml() method to clean invalid characters and CDATA
2. _extract_with_regex() fallback parser using regex patterns
3. Enhanced get_element_text() to handle nested elements
4. Better error handling with informative logging

Tested with various malformed inputs including unclosed tags,
invalid characters, CDATA sections, and completely broken XML.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Issues with the XML parsing prompt outputs

1 participant